Assumption

Markov assumption: The future is conditionally independent of the past given present
\[P(S_{3}|S_0, S_1, S_2) \approx P(S_{3} | S_2)\]
- Generalizing it to \(t\) time steps
\[P(S_{t+1}|S_0, \dots, S_t) \approx P(S_{t+1} | S_t)\]
Example

State space: Represent the unique observations in the world. We have a finite set of possible states we can be in at time \(t\). We have discrete timesteps: \(t = 0, t = 1, \dots\). We can be in only one state at a given time. Here \(S = \{HOT, COLD, WARM\}\).
Initial discrete probability distribution over states: The probability of starting with a particular state.
- \(\pi_0 = \begin{bmatrix} P(\text{HOT at time 0}) & P(\text{COLD at time 0}) & P(\text{WARM at time 0}) \end{bmatrix} = \begin{bmatrix} 0.5 & 0.3 & 0.2 \end{bmatrix}\)
Transition probability matrix \(T\), where each \(a_{ij}\) represents the probability of moving from state \(s_i\) to state \(s_j\), such that \(\sum_{j=1}^{n} a_{ij} = 1, \forall i\)
\[ T =
\begin{bmatrix}
\text{P(HOT|HOT)} & \text{P(COLD|HOT)} & \text{P(WARM|HOT)}\\
\text{P(HOT|COLD)} & \text{P(COLD|COLD)} & \text{P(WARM|COLD)}\\
\text{P(HOT|WARM)} & \text{P(COLD|WARM)} & \text{P(WARM|WARM)}\\
\end{bmatrix}
=
\begin{bmatrix}
0.5 & 0.2 & 0.3\\
0.2 & 0.5 & 0.3\\
0.3 & 0.1 & 0.6\\
\end{bmatrix}
\]
Homogeneous Markov chains: Transition probabilities are the same for all \(t\).
- Predict probabilities of sequences of states: Probability of the sequences: HOT, HOT, WARM, COLD
\[\begin{equation}
\begin{split}
P(\textrm{HOT, HOT, WARM, COLD}) =& P(\textrm{HOT}) \times P(\textrm{HOT|HOT})\\
& \times P(\textrm{WARM|HOT})\\
& \times P(\textrm{COLD|WARM})\\
=& 0.5 \times 0.5 \times 0.3 \times 0.1\\
\end{split}
\end{equation}\]
- Inference: Probability of being in a particular state at time \(t\), \(\pi_t = \pi_{t-1} \times T\)
- Applying the matrix multiplication to the current state probabilities does an update to the state probabilities!
- Stationary distribution: A stationary distribution of a Markov chain is a probability distribution over states that remains unchanged in the Markov chain as time progresses.
- A probability distribution \(\pi\) on states \(S\) is stationary where the following holds for the transition matrix: \(\pi T=\pi\)
- Conditions for stationary distribution
- Sufficient condition for existence/uniqueness is positive transitions: \(P(s_t | s_{t-1}) > 0\)
- Weaker sufficient conditions for existence/uniqueness
- Irreducible: A Markov chain is irreducible if it is possible to get to any state from any state. It does not get stuck in part of the graph.
- Aperiodic: Loosely, a Markov chain is aperiodic if it does not keep repeating the same sequence. Check this if you want to know the formal definition.
Ngram language models using Markov Chains
- Bad Model: just count occurrences to get conditional probabilities. The counts will be tiny and the model will be very sparse.
\[
P(\textrm{In the age of data algorithms have the answer}) = P(\textrm{In}) \times P(\textrm{the|In})
\times P(\textrm{age|In the}) \times P(\textrm{of|In the age})
\times P(\textrm{data|In the age of})
\times P(\textrm{algorithms|In the age of data})
\times P(\textrm{have|In the age of data algorithms})
\dots
\]
- Bigram language model: We assume \(P(\textrm{algorithms|In the age of data}) \approx P(\textrm{algorithms|data})\)
\[P(\textrm{In the age of data algorithms have the answer}) = P(\textrm{In}) \times P(\textrm{the|In})
\times P(\textrm{age|the})
\times P(\textrm{of|age})
\times P(\textrm{data|of})
\times P(\textrm{algorithms|data})
\times P(\textrm{have|algorithms})
\times P(\textrm{the|have})
\times P(\textrm{answer|the})
\]
- Trigram language model: Consider this corpus = {a rose is a rose}, Vocabulary size: \(V = 3\)
- Word trigram model of language (n = 2)
- Each state consists of a sequence of two words (bigrams) from the vocabulary.
- State space = {(a a), (a rose), (a is), (rose a), (rose rose), (rose is), (is a), (is rose), (is is)} (\(3^2\) states). Many of these might not occur in the corpus.
- Markov assumption: \(P(s_{t+1}|s_0, \dots s_t) = P(s_{t+1}|s_t)\)
- \(P(\text{(a rose)|(a rose), (rose is), (is a)}) = P(\text{(a rose)}|\text{(is a)})\)
- What transitions are possible?
Some Disadvantages of n-gram models
- In many cases, we can get by with ngram models. But in general, it is not a good assumption that the next word that I utter will be dependent on the last \(n\) words? For e.g. The computer I was talking about yesterday when we were having dinner crashed. Language has long-distance dependencies.
- We can extend it to \(3\)-grams, \(4\)-grams, \(5\)-grams. But then there is sparsity problem.
- Also, ngram models have huge RAM requirements.
- Ngram are great but we are representing context as the exact word.
- Suppose in your training data you have the sequence “feed the cat” but you do not have the sequence “feed the dog”. Trigram model: P(dog|feed the) = 0
- If we represent words with embedding instead, we will be able to generalize to dog even if we haven’t seen it in the corpus if we use RNNs.